Revising the METU-Sabancı Turkish Treebank: An Exercise in Surface-Syntactic Annotation of Agglutinative Languages
نویسندگان
چکیده
In this paper, we present a revision of the training set of the METU-Sabancı Turkish syntactic dependency treebank composed of 4997 sentences in accordance with the principles of the Meaning-Text Theory (MTT). MTT reflects the multilayered nature of language by a linguistic model in which each linguistic phenomenon is treated at its corresponding level(s). Our analysis of the METU-Sabancı syntactic relation tagset reveals that it encodes deepmorphological and surface-syntactic phenomena, which should be separated according to the MTT model. We propose an alternative surface-syntactic relation annotation schema and show that this schema also allows for a sound projection of the obtained surface annotation onto a deepsyntactic annotation, as needed for the implementation of down-stream language understanding applications.
منابع مشابه
Morpheme Segmentation in the METU-Sabancı Turkish Treebank
Morphological segmentation data for the METU-Sabancı Turkish Treebank is provided in this paper. The generalized lexical forms of the morphemes which the treebank previously lacked are added to the treebank. This data maybe used to train POS-taggers that use stemmer outputs to map these lexical forms to morphological tags.
متن کاملITU Validation Set for Metu-Sabancı Turkish Treebank
The Turkish Treebank (Oflazer et al., 2003; Atalay et al., 2003) created by the Middle East Technical University and Sabancı University is available to the researchers since 2003 and it is used by many researchers since then (Eryiğit and Oflazer, 2006; Eryiğit et al., 2006b; Eryiğit et al., 2006a; Nivre et al., 2007; Çakıcı and Baldridge, 2006; Buchholz and Marsi, 2006; Yüret, 2006; Wu et al., ...
متن کاملAn annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies
A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...
متن کاملUse of Lexical Statistics for Compound Word Recognition and Segmentation in Turkish
Compound words are cross-linguistic morphological phenomena that occur in all languages. Compound words are widely accepted to be stored in the lexicon but their constituents need to be accessed during both language learning and production processes. In this study, the use of corpora was investigated for how to differentiate single-stem words from single-word compounds and then how to segment c...
متن کاملRepresentation of Morphosyntactic Units and Coordination Structures in the Turkish Dependency Treebank
This paper presents our preliminary conclusions as part of an ongoing effort to construct a new dependency representation framework for Turkish. We aim for this new framework to accommodate the highly agglutinative morphology of Turkish as well as to allow the annotation of unedited web data, and shape our decisions around these considerations. In this paper, we firstly describe a novel syntact...
متن کامل